Predicting how the stock market will perform is one of the most difficult things to do. There are so many factors involved in the prediction – physical factors vs. physiological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.
The S&P 500 is widely regarded as the best single gauge of large-cap U.S. equities. The index includes 500 leading companies and captures approximately 80% coverage of available market capitalization.
We’ll look at the S&P 500, an index of the largest US companies. The S&P 500 is an American stock market index based on the market capitalization of 500 large companies having common stock listed on the NYSE, NASDAQ Exchange.
I will load all 500 dataset in S&P 500 for analysis by using portfolio optimization to get the possible several stocks with higher return and lower risk. And using the machine learning predict the investment trend for S&P 500 index.
What are the top 20 higher monthly return among all 500 number of stocks in S&P500 by Mathematical programming? The target is to find out the top valuable, higher return with lower risk of stocks.
Could I invest these top 20 stocks now by analysis for the trend of S&P500 index by Machine learning? It is to determine if I could invest these stocks by choosing the most accuracy model with the trend.
Automatic trading without anyone involved will be the trend of stock market near future. I would like to use the data science methods to make a strategic for investment.
I will study which method of machine learning would be more accurate, suitable for prediction by using root-mean-squared error, that the prediction will be more meaningful in use.
First of all, I will construct the portfolio optimization in order to achieve a maximum expected return given their risk preferences due to the fact that the returns of a portfolio are greatly affected by nature of the relationship between assets and their weights in the portfolio.
The top 20 monthly return of stocks will be get into the portfolio optimization. Then in order to get the higher return and lower risk of stocks, the portfolio optimization will be conducted to find out which are the best choose of investment and generate the visualization for returns and volatility.
For the next part, I will work with historical data about the S&P500 price index to understand if I can invest in market this moment. I will implement a mix of machine learning algorithms to predict the future stock price of this company, starting with simple algorithms like averaging and linear regression, and then moving on to advanced techniques like Auto ARIMA and LSTM.
And I will compare the models by using root-mean-squared error (RMSE) to measure of how model performed and measure difference between predicted values and the actual values.
Stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.
Fundamental Analysis involves analyzing the company’s future profitability on the basis of its current business environment and financial performance. Technical Analysis, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market.
We’ll scrape all S&P 500 tickers from Wiki and load all 500 dataset to be in cleaning and appending the adjusted closing price from 2008 to 2018.
Moving Average - The predicted closing price for each day will be the average of a set of previously observed values. Instead of using the simple average, we will be using the moving average technique which uses the latest set of values for each prediction.
Linear Regression - The most basic machine learning algorithm that can be implemented on this data is linear regression. The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable.
K-Nearest - Neighbors Another interesting ML algorithm that one can use here is kNN (k nearest neighbours). Based on the independent variables, kNN finds the similarity between new data points and old data points.
ARIMA - ARIMA is a very popular statistical method for time series forecasting. ARIMA models take into account the past values to predict the future values.
Long Short Term Memory (LSTM) - LSTMs are widely used for sequence prediction problems and have proven to be extremely effective.
The S&P 500 or Standard & Poor's 500 Index is a market-capitalization-weighted index of the 500 largest U.S. publicly traded companies. The index is widely regarded as the best gauge of large-cap U.S. equities.
The S&P 500 uses a market capitalization weighting method, giving a higher percentage allocation to companies with the largest market capitalizations. The market capitalization of a company is calculated by taking the current stock price and multiplying it by the outstanding shares.
Reference:
Time series is a collection of data points collected at constant time intervals. These are analyzed to determine the long term trend so as to forecast the future or perform some other form of analysis. But what makes a TS different from say a regular regression problem? There are 2 things:
It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case. Along with an increasing or decreasing trend, most TS have some form of seasonality trends, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over time, you will invariably find higher sales in winter seasons.
Reference:
Modern portfolio theory (MPT) provides investors with a portfolio construction framework that maximizes returns for a given level of risk, through diversification. MPT reasons that investors should not concern themselves with an individual investment’s expected return, but rather the weighted average of the expected returns of a portfolio’s component securities as well as how individual securities move together. Markowitz consequently introduced the concept of covariance to quantify this co-movement.
It proposed that investors should instead consider variances of return, along with expected returns, and choose portfolios offering the highest expected return for a given level of variance. These portfolios were deemed “efficient.” For given levels of risk, there are multiple combinations of asset classes (portfolios) that maximize expected return. Markowitz displayed these portfolios across a two-dimensional plane showing expected return and standard deviation, which we now call the efficient frontier.
Reference:
Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. It is a technique used to understand the impact of risk and uncertainty in prediction and forecasting models.
The technique was first developed by Stanislaw Ulam, a mathematician who worked on the Manhattan Project. After the war, while recovering from brain surgery, Ulam entertained himself by playing countless games of solitaire. He became interested in plotting the outcome of each of these games in order to observe their distribution and determine the probability of winning. After he shared his idea with John Von Neumann, the two collaborated to develop the Monte Carlo simulation.
Reference:
Smoothing is a technique applied to time series to remove the fine-grained variation between time steps.
The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes. Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting.
Calculating a moving average involves creating a new series where the values are comprised of the average of raw observations in the original time series.
Reference:
Linear regression is a very simple approach for supervised learning. Though it may seem somewhat dull compared to some of the more modern algorithms, linear regression is still a useful and widely used statistical learning method. Linear regression is used to predict a quantitative response Y from the predictor variable X.
Linear Regression is made with an assumption that there’s a linear relationship between X and Y.
Reference:
K-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
Reference:
An ARIMA model is a class of statistical models for analyzing and forecasting time series data.
It explicitly caters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skillful time series forecasts.
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.
This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:
Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.
The parameters of the ARIMA model are defined as follows:
Reference:
Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.
Prophet is open source software released by Facebook's Core Data Science team.
The Prophet procedure is an additive regression model with four main components:
Reference:
Long Short-Term Memory usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is their default behavior.
All recurrent neural networks have the form of a chain of repeating modules of a neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. LSTMs also have this chain like structure, but the repeating module has a different structure. The key to LSTMs is the cell state which is acting like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to flow along it unchanged.
Reference:
This project are separated 4 part of analysis from data exploration, visualization, correlation and monthly return for data extraction by mathematical programming, portfolio optimization and machine learning.
The following strategic analysis as follow:
To use Beautifulsoup grap stocks symbol for S&P 500
To use API in the function of "fix_yahoo_finance" to load all dataset for 500 stocks share from January 2008 until Now, each stocks symbol will be used to load their own CSV dataset file.
To select each stocks of adjusted closing price and use the "join" function and rename the columns to create the joined CSV
To use both "plotly" and "matplotlib" for data visualization, "iplot" can be used to compare the detail price in certain period of time.
To use corr() function to find out the top 30 stocks which are higher correlation with S&P 500 index from 500 stocks share.
The Last part of this project will conduct the prediction of machine learning for S&P 500 index, so the correlation between stocks and index is important reference for the trend affected by index fluctuation. And it will provide the buy and sell signal when the prediction are generated.
After getting the top 30 stocks of highest correlation with index, we will calculate and compare the higher monthly return for this 30 stocks.
To calculate the top 10 stocks of highest monthly return from the above 30. we will sort and extract the joined dataset for the next part of Portfolio optimization.
After getting the top 10 stocks from part 2 by using Highest correlation and monthly return without consideration the factor of risk, but this stage we would like to use the portfolio optimization method to decide the investment strategic.
Calculating the top 10 higher monthly return of stocks share as Medium or long term investment
Using portfolio optimization to calculate the top 10 higher monthly return of stocks share as Medium or long term investment which are higher monthly return and lower risk as investment strategic.
After portfolio optimization analysis, it assume we get the proportion of investment strategic, we will expect to know if it is time to invest this moment or when is the best momnent for buying or selling. Therefore, the machine learning for S&P 500 index will be conducted in machine learning process by comparing different methods and model.
The command process as follow:
- Dropping NAN with dropna in **pandas**
- splitting into train and validation
- Measuring root mean square error (RMSE) for the standard deviation of residuals
- Plot the prediction
1. Moving average
2. Linear Regression
- Using the function of **linear_model** in **sklearn**
3. k-Nearest Neighbours
- Using the function of **neighbors, GridSearchCV, MinMaxScaler** in **sklearn** to find the best parameter
4. Auto ARIMA
- Using the function of **auto_arima** in **pmdarima** which automatically selects the best combination of (p,q,d) that provides the least error.
5. Prophet
- Using the function of **Prophet in fbprophet** which Prophet, designed by Facebook, is a time series forecasting library that The input for Prophet is a dataframe with two columns: date and target.
6. Long Short Term Memory
- Using the function of **MinMaxScaler in sklearn** and **LSTM in keras** to create and fit the LSTM network
In order to get the total 500 stocks shares from January 2008 until Now (April 20 2019) around 10 years data. The BeautifulSoup method has been used for scrapying the dataset in web of Nasdaq (http://www.nasdaq.com/symbol/), but it found that the "id" will be changed automatically in certain period, it is not a stable method to get the dataset in our research project, the API of yahoo finance will be the best way to load large range of data.
Stock Symbols are scraping by "BS4 - BeautifulSoup" in Web of Wiki (http://en.wikipedia.org/wiki/)
Adding the S&P index symbol into the list for measuring the correlation of stocks share.
Before we analyze stock data, we need to get it into some workable format. Stock data can be obtained from Yahoo! Finance, Google Finance, or a number of other sources. These days I recommend getting data from Yahoo! Finance, a provider of community-maintained financial and economic data. Yahoo! Finance used to be the go-to source for good quality stock data.
Then to input stocks symbols scraped in API in the function of "fix_yahoo_finance", it loads all dataset for 500 stocks share from January 2008 until Now.
To get all company pricing data in the S&P 500, and now we're going to pull stock pricing data on all of them.
Process for data preparation:
After getting the tickers:
Then it use datetime to specify dates from 2008 until now for the Pandas datareader, os is to check for, and create, directories.
The datasets for each stocks are downloaded by the function of Pandas datareader and use "to_csv" to generate each dataset csv files.
Compiling all the CSV stock data files by using the adjusted close price and rename the columns to each stock tickers and export to a "Joined" csv file.
We collected the total 504 number of stock shares with S&P 500 index value from January 2008 until Now (03-14-2019)
We considered if it plot all 500 stock shares, it can't appear a meaningful comparation graph. Initially, we view the first 5 column including S&P 500 index in the time series from 2008 to 2019 (Now). We will further see the top 10 monthly return, top 10 for higher correlation with S&P 500 index, and top 10 investing stocks strategic from the portfolio optimization in our data time series.
The offline plotly has been used for visualization instead of using "matplotlib" function to plot the graph, because it include too many stocks share and also the higher closing price is not equal to earn higher return in investment. It is not a valuable comparing method.
If only checking the adjusted closing price on the graph, it does not get any trend or prediction result. Therefore, the following parts will conduct portfolio optimization and machine learning analysis for getting the best strategic to have a highest return investment.
Does it have strong relation with S&P 500 index fluctation when we invest some stocks?
Most people think that they will get higher return when the main index going up, so we can calculate the correlation to check which stocks share are strong relationship and low relationship.
To use corr() function to find out the top 30 stocks which are higher correlation with S&P 500 index from 500 stocks share.
The Last part of this project will conduct the prediction of machine learning for S&P 500 index, so the correlation between stocks and index is important reference for the trend affected by index fluctuation. And it will provide the buy and sell signal when the prediction are generated.
After getting the top 30 stocks of highest correlation with index, we will calculate and compare the higher monthly return for this 30 stocks.
To calculate the top 10 stocks of highest monthly return from the above 30. we will sort and extract the joined dataset for the next part of Portfolio optimization.
We create a new dataframe for higher correlation stocks share with S&P 500 index to further analysis.
We can believe if choosing the above stocks share for investment, the S&P index will be a higher relative reference.
Assuming we pick the tickers which are the most higher correlation with S&P 500 index (^GSPC), the trend of machine learning will be meaningful and significate for prediction
If the other stock shares which is less correlation with S&P 500 index, it is no prediction value when we build the machine learning model for S&P 500 index.
A “better” solution, though, would be to plot the information we actually want: the stock’s returns. This involves transforming the data into something more useful for our purposes. There are multiple transformations we could apply.
One transformation would be to consider the stock’s return since the beginning of the period of interest.
I am using a lambda function, which allows me to pass a small function defined quickly as a parameter to another function or method
Using 10 years cumulative return for selecting higher return, it may be some problem, like it does not reflect the trend of industry, because new industry may have large return only these few years, it may use only 3 years cumulative return, it would be more realistic. But now in research level, we keep using 10 years sum of return.
Apart from the risk consideration, the percentage of increasing return can be 9.8 times over 10 year if just only invest "IT" tickers of stock. Of course, that can be a part of investment strategic we can notice.
“Modern Portfolio Theory (MPT), a hypothesis put forth by Harry Markowitz in his paper “Portfolio Selection,” (published in 1952 by the Journal of Finance) is an investment theory based on the idea that risk-averse investors can construct portfolios to optimize or maximize expected return based on a given level of market risk, emphasizing that risk is an inherent part of higher reward. It is one of the most important and influential economic theories dealing with finance and investment.
In order to choose from 10 number of stocks share from the result of part 1 which are higher monthly return and lower risk as our final medium or long term investment strategic.
We have 10 stocks in our portfolio. One decision we have to make is how we should allocate our budget to each of stock in our portfolio. If our total budget is 1, then we can decide the weights for each stock, so that the sum of weights will be 1. And the value for weights will be the portion of budget we allocate to a specific stock. For example, if weight is 0.5 for Amazon, it means that we allocate 50% of our budget to Amazon.
“portfolio_annualised_performance” will calculate the returns and volatility, and to make it as an annualised calculation into 252 trading days in one year. “random_portfolios” function will generate portfolios with random weights assigned to each stock, and by giving num_portfolios argument, you can decide how many random portfolios you want to generate.
Firstly I download daily price data for each of the stocks in the portfolio, and convert daily stock prices into daily returns. And then it need to calculate annualised portfolio return and annualised portfolio volatility.
Portfolio standard deviation
The first is the calculation for portfolio’s volatility in “portfolio_annualised_performance” function.

This formula can be simplified if we make use of matrix notation.

For matrix calculation, we get the part inside the square root in the original formula. Same as the annualised return, I took into account of 252 trading days to calculate the annualised standard deviation of a portfolio.
Sharpe ratio
For the Sharpe ratio, Risk-adjusted return refines an investment’s return by measuring how much risk is involved in producing that return, which is generally expressed as a number or rating. There could be a number of different methods of expressing risk-adjusted return, and the Sharpe ratio is one of them.
The ratio describes how much excess return you are receiving for the extra volatility that you endure for holding a riskier asset. The Sharpe ratio can be expressed in below formula.

I can get daily returns by calling pct_change on the data frame with the price data. And the mean daily returns, the covariance matrix of returns are needed to calculate portfolio returns and volatility. We will generate 25,000 random portfolios. Finally, the risk-free rate has been taken from U.S. Department of The Treasury. The rate of 1.78% is the 52week treasury bill rates at the start of 2018.
It generates random portfolio and gets the results (portfolio returns, portfolio volatility, portfolio Sharpe ratio) and weights for the corresponding result.
Then by locating the one with the highest Sharpe ratio portfolio, it displays maximum Sharpe ratio portfolio as red star sign.
And does similar steps for minimum volatility portfolio, and displays it as a green star on the plot.
For minimum risk portfolio, we can see around 30% of our budget is allocated to "Cost" - Costco. If you take another look at the daily return plot from earlier, the Costco is the least volatile among these stocks, so allocating a large percentage to Costco for minimum risk portfolio makes sense.
If we are willing to take higher risk for higher return, one that gives us the best risk-adjusted return is the one with maximum Sharpe ratio. In this scenario, we are allocating a significant portion to "SHW" - Sherwin-Williams and "IT" - Gartner Inc, which are quite volatile stocks from the previous plot of daily returns. And "Cost" - Costco which had around 30% in the case of minimum risk portfolio, has only 10% budget allocated to it.
From the plot of the randomly simulated portfolios, we can see it forms a shape of an arch line on the top of clustered blue dots. This line is called efficient frontier.
Points along the line will give you the lowest risk for a given target return. All the other dots right to the line will give you higher risk with same returns. If the expected returns are the same, the option with lower risk should be taken.
The way for two kinds of optimal portfolio above was by simulating many possible random choices and pick the best ones (either minimum risk or maximum risk-adjusted return) by using Scipy’s optimize function.
Scipy’s optimize function is doing the similar task when given what to optimize, and what are constraints and bounds.
Below functions are to get the maximum Sharpe ratio portfolio. In Scipy’s optimize function, there’s no ‘maximize’, so as an objective function you need to pass something that should be minimized and “neg_sharpe_ratio” is computing the negative Sharpe ratio and use this as our objective function to minimize.
In “max_sharpe_ratio” function:
constraints = ({‘type’: ‘eq’, ‘fun’: lambda x: np.sum(x) — 1})
The above constraint is saying that sum of x should be equal to 1.
np.sum(x) == 1 has become np.sum(x)-1
It simply means that the sum of all the weights should be equal to 1. It cannot be allocated more than 100% of budget in total.
“bounds” is giving another limit to assign random weights, by saying any weight should be inclusively between 0 and 1. You cannot give minus budget allocation to a stock or more than 100% allocation to a stock.
The efficient frontier is to draw a line which depicts where the efficient portfolios for a given risk rate.
The first function “efficient_return” is calculating the most efficient portfolio for a given target return, and the second function “efficient_frontier” will take a range of target returns and compute efficient portfolio for each return level.
Let’s try to plot the portfolio choices with maximum Sharpe ratio and minimum volatility also with all the randomly generated portfolios.
But this time, it does not pick the optimal ones from the randomly generated portfolios, but we are actually calculating by using Scipy’s ‘minimize’ function. And the below function will also plot the efficient frontier line.
The slight difference is that the Scipy’s “optimize” function has not allocated any budget at all for Costco on maximum Sharpe ratio portfolio, while one we chose from the randomly generated samples has 0.45% of allocation for Costco. There are some differences in the decimal places but more or less same.
Instead of plotting every randomly generated portfolio, we can plot each individual stocks on the plot with the corresponding values of each stock’s annual return and annual risk. This way we can see and compare how diversification is lowering the risk by optimising the allocation.
As you can see from the above plot, the stock with the least risk is COST at around 0.22, but the return is only around 0.16. If we will to take slightly more risk around 0.225 return, we should consider to choose HD and SHW rather than IT with higher risk with portfolio optimisation.
We will work with historical data about the stock prices of a publicly listed company. We will implement a mix of machine learning algorithms to predict the future stock price of this company, starting with simple algorithms like averaging and linear regression, and then move on to advanced techniques like Auto ARIMA and LSTM.
In this technical analysis part, we will use a dataset from Part1 - Data Exploration and load the S&P 500 index for analysis.
In order to confirm the trend of market, we can determine if it is time to invest the stocks which we obtain from Part 3 - portfolio optimization. Because it is the highest correlation with S&P 500 index, it is an important technical analysis if the trend of model is upward, that mean we can invest this moment.
The profit or loss calculation is usually determined by the closing price of a stock for the day, hence we will consider the closing price as the target variable. Let’s plot the target variable to understand how it’s shaping up in our data:
In the upcoming sections, we will explore these variables and use different techniques to predict the daily closing price of the stock.
The root-mean-squared error (RMSE) is a measure of how well your model performed. It does this by measuring difference between predicted values and the actual values.
Let’s say you feed a model some input X and your model predicts 10, but the actual value is 5. This difference between your prediction (10) and the actual observation (5) is the error term: (y_prediction - y_actual).
The error term is important because we usually want to minimize the error. In other words, our predictions are very close to the actual values.

But there are different ways we could minimize this error term. We could minimize the squared error. Or minimize the absolute value of the error.
In our case, we take the square root of the squared difference (RMSE):

In our example above, we would get:
‘Average’ is easily one of the most common things we use in our day-to-day lives. For instance, calculating the average marks to determine overall performance, or finding the average temperature of the past few days to get an idea about today’s temperature – these all are routine tasks we do on a regular basis. So this is a good starting point to use on our dataset for making predictions.
The predicted closing price for each day will be the average of a set of previously observed values. Instead of using the simple average, we will be using the moving average technique which uses the latest set of values for each prediction. In other words, for each subsequent step, the predicted values are taken into consideration while removing the oldest observed value from the set.

The most basic machine learning algorithm that can be implemented on this data is linear regression. The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable.
The equation for linear regression can be written as:

Here, x1, x2,….xn represent the independent variables while the coefficients θ1, θ2, …. θn represent the weights.
We will first sort the dataset in ascending order and then create a separate dataset with new feature created by using add_datepart from fastai library.
This creates features such as:
‘Year’, ‘Month’, ‘Week’, ‘Day’, ‘Dayofweek’, ‘Dayofyear’, ‘Is_month_end’, ‘Is_month_start’, ‘Is_quarter_end’, ‘Is_quarter_start’, ‘Is_year_end’, and ‘Is_year_start’.
Apart from this, we can add our own set of features that we believe would be relevant for the predictions. For instance, my hypothesis is that the first and last days of the week could potentially affect the closing price of the stock far more than the other days. So I have created a feature that identifies whether a given day is Monday/Friday or Tuesday/Wednesday/Thursday.
The RMSE value is lower than the previous technique, which clearly shows that linear regression has performed better.
Linear regression is a simple technique and quite easy to interpret, but there are a few obvious disadvantages. One problem in using regression algorithms is that the model overfits to the date and month column. Instead of taking into account the previous values from the point of prediction, the model will consider the value from the same date a month ago, or the same date/month a year ago.
ML algorithm can use here is kNN (k nearest neighbours). Based on the independent variables, kNN finds the similarity between new data points and old data points.
Here is a simple example to explain the concept of KNN:
Consider the height and age for 11 people on the basis of given features (‘Age’ and ‘Height’),

To determine the weight for ID #11, kNN considers the weight of the nearest neighbors of this ID. The weight of ID #11 is predicted to be the average of it’s neighbors. If we consider three neighbours (k=3) for now, the weight for ID#11 would be = (77+72+60)/3 = 69.66 kg.

The training error rate and the validation error rate are two parameters we need to access on different K-value.
The RMSE value of KNN is higher than the linear regression model and the plot shows the same pattern.
Like linear regression, kNN also identified a raising trend from January 2019 since that has been the pattern for the past years. We can safely say that regression algorithms have not performed well on this dataset.
ARIMA is a very popular statistical method for time series forecasting. ARIMA models take into account the past values to predict the future values. There are three important parameters in ARIMA:
Parameter tuning for ARIMA consumes a lot of time, so using auto ARIMA automatically selects the best combination of (p,q,d) that provides the least error.
Auto ARIMA takes into account the AIC and BIC values generated (as you can see in the code) to determine the best combination of parameters. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values are estimators to compare models. The lower these values, the better is the model.
Auto ARIMA model uses past data to understand the pattern in the time series. Using these values, the model captured an increasing trend in the series. Although the predictions using this technique are far better than that of the previously implemented machine learning models, these predictions are still not close to the real values.
There are a number of time series techniques that can be implemented on the stock prediction dataset, but most of these techniques require a lot of data preprocessing before fitting the model.
Prophet, designed and pioneered by Facebook, is a time series forecasting library that requires no data preprocessing and is extremely simple to implement. The input for Prophet is a dataframe with two columns: date and target (ds and y).
Prophet tries to capture the seasonality in the past data and works well when the dataset is large.
Prophet (like most time series forecasting techniques) tries to capture the trend and seasonality from past data. This model usually performs well on time series datasets, but fails to live up to it’s reputation in this case.
As it turns out, stock prices do not have a particular trend or seasonality. It highly depends on what is currently going on in the market and thus the prices rise and fall. Hence forecasting techniques like ARIMA, SARIMA and Prophet would not show good results for this particular problem.
LSTMs are widely used for sequence prediction problems and have proven to be extremely effective. The reason they work so well is because LSTM is able to store past information that is important, and forget the information that is not.
LSTM has three gates:
The LSTM model can be tuned for various parameters such as changing the number of LSTM layers, adding dropout value or increasing the number of epochs.
Although the model seem to predict well with real data, but it is not enough to identify whether the stock price will increase or decrease. Stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies.
There are certain intangible factors as well which can often be impossible to predict beforehand.
From Part 1, we compare all the S&P500 stocks share and choose the top 30 highest correlation with index for comparsion from all 500 stocks share. After calculation and sort the correlation, the average value are around 0.8 - 0.9.
For the second part, we select the top 10 higest monthly return from above 30 stocks share by comparing the percentage of monthly return. These 10 stocks share selected will indicate the higher monthly return for 10 years with higher correlation with S&P500 index, including Gartner Inc, Sherwin-Williams, Home Depot, The Cooper Companies, Roper Technologies, SBA Communications, Moody's Corp, Amphenol Corp, Fidelity National Information Services and Costco Wholesale Corp.
The portfolio optimization has been used to generate the proportion of investment strategic, the stock with the least risk is COST at around 0.22, but the return is only around 0.16. If we will to take slightly more risk around 0.225 return, we should consider to choose HD and SHW rather than IT with higher risk with portfolio optimisation.
For the last part of machine learning, building the model for S&P 500 index has been implemented for the data of time series including Moving average, Linear regression, k-Nearest Neighbours, auto arima, Long Short Term Memory to generate the prediction and compare the root mean square error in order to get the less error model and most accuracy model. From the result of LTSM with lowest rmse, the prediction index is to drop down from 2095 (present index) to 2899 (prediction value), so now is not a good investment time.
As we mentioned above, it is not enough to identify whether the stock price will increase or decrease. Stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies. Therefore, the sentiment analysis for financial news by NLTK should also be done to improve the factor of accuracy with trend of machine learning.